A Bayesian Spatial Scan Statistic for Under-reported Data

August 14, 2025

Table of Contents

  • Introduction
  • Proposed Method
  • Simulation Study
  • Application: Texas COVID-19 Data
  • Discussion and Conclusion

Introduction

Public Health Surveillance

Public health surveillance
The systematic, ongoing assessment of the health of a community including the timely collection, analysis, interpretation, dissemination and subsequent use of data. 1

Outbreak Detection

A subset of disease surveillance methods focus on disease progression and identifying hotspots.

Novel disease monitoring

New diseases often lack reliable testing and reporting systems. Early cases may be missed or misclassified, obscuring disease surveillance techniques that assume complete cases.

Examples

  • COVID-19
  • HIV/AIDS
  • Tuberculosis (TB)

Accounting for Under-reporting

Most methods proposed for modeling under-reported or misclassified data fall into two categories:

  1. Double sampling
    • Use multiple data scources to augment under/mis-reported data and reduce bias in estimating parameters
    • Utilize two independent data collection mechanisms with different reporting characteristics
  2. Latent variable models
    • Models that explicitly account for the reporting process itself
    • Almost eclusively Bayesian models

Spatial Scan Statistics

General Concept

Scan statistics

  1. Select candidate regions
  2. Calculate relative risk inside and outside of candidate region
  3. Determine region with largest difference

Visualization of Spatial Scan Regions

Scan Statistic Development

timeline
    title Spatial Scan Statistic Development
    1965 : Conceptual basis - Naus
    1997 : Basic Spatial Scan Statistic (Frequentist)
    1998 : Space-Time Extension (Frequentist)
    2005 : Flexible Shapes (Frequentist)
    2005 : Bayesian Spatial Scan Statistic
    2007 : Multivariate Spatial Scan Statistic (Frequentist)
    2012 : Overdispersed data extension (Frequentist) 
    2017 : Bayesian Spatial Scan Statistic for Zero-inflated count data
    2018 : Wald-based Spatial Scan Statistics (Frequentist)
    2024 : Bayesian Spatial Scan Statistic for Multinormal data

  • Since the formalization in 1997 spatial scan statistics have been used and described as a method for epidemiologists
  • No extension to account for under-reported count data

Frequentist Spatial Scan Statistic

  • The framework assumes that we observe counts \(z_i\) such that \(z_i \sim \text{Poisson}(qb_i)\)
    • Where \(b_i\) represents the known baseline/at risk population of location \(i\)
    • \(q\) is the unknown underlying disease rate

\[ H_0: \text{No cluster (common rate for all regions)} \\ H_1(S): \text{Cluster in subset }S\text{ with elevated rate vs. outside } S \]

  • Compute likelihood ratio test statistic for each candidate zone \(S\)
  • The scan statistic test statistic is \(\Lambda = \max_{S \in C}\lambda(S)\).
  • Generate Monte Carlo samples under \(H_0\) to calculate P-value

Bayesian Spatial Scan Statistics

  • Assuming we observe count data \(z_i\) in area \(i\), each associated with baseline \(b_i\)
  • Under the null hypothesis there is no cluster and all locations share \(q_{all}\) \[ z_i \sim \text{Poisson}(q_{all} b_i), \quad q_{all} \sim \text{gamma}(\alpha_{all}, \beta_{all})\]
  • The alternative hypothesis for each candidate cluster \(S \in \mathcal{S}\), where \(\mathcal{S}\) is the space of all possible clusters \[ \begin{cases} z_i \sim \text{Poisson}(q_{in} b_i), &i \in S, \quad q_{in} \sim \text{Gamma}(\alpha_{in}, \beta_{in}), \\ z_i \sim \text{Poisson}(q_{out} b_i), &i \notin S, \quad q_{out} \sim \text{Gamma}(\alpha_{out}, \beta_{out}). \end{cases} \]
  • Marginal likelihoods based on the gamma-Poisson model
  • Conjugate model possible to solve for closed form solution

Bayesian Spatial Scan Statitic Testing

  • Using the maringal likelihoods from the models the posterior probability under the null is \[P(H_0 | D) = \frac{P(D|H_0) P(H_0)}{P(D)}\]
  • The posterior probability under the alternative is \[P(H_1(S) | D) = \frac{P(D|H_1(S)) P(H_1(S))}{P(D)}\]
  • Then we can return regions with non-negligible posterior probabilities
  • Since we have the full posterior probability distributions there is no need for randomization testing
  • Bayes factors can be used to provide a direct measure of evidence for one hypothesis over the other

Bayesian Interpretation

Interpretation of Bayes factor
BF Log(BF) Strength of evidence \(H_1(S)\)
1 to 3.2 0 to 1.16 Not Significant
3.2 to 10 1.16 to 2.30 Positive
10 to 100 2.30 to 4.61 Strong
\(>\) 100 \(> 4.61\) Decisive

Proposed Method

Model

  • We propose a novel Bayesian spatial scan statistic model by modeling the true counts as a latent variable and introducing reporting probability \(p\).
  • Our spatial scan statistic is based on the hierarchical model \[ z_i \sim \text{Poisson}(p \times q \times b_i) \\ q \sim \text{gamma}(\alpha, \beta) \\ p \sim \text{beta}(\alpha_p, \beta_p) \]
  • Model no longer conjugate

Bayesian Spatial Scan Statistic Extension

  • The new null hypotheses assumes no clusters \[ z_i \sim \text{Poisson}(p \times q_{all} \times b_i), \quad q_{all} \sim \text{gamma}(\alpha_{all}, \beta_{all}), \quad p \sim \text{beta}(\alpha, \beta) \]
  • The resulting alternative hypotheses for candidate region \(S\) is \[ \begin{cases} z_i \sim \text{Poisson}(p \times q_{in} \times b_i), &i \in S, \quad q_{in} \sim \text{Gamma}(\alpha_{in}, \beta_{in}), \\ z_i \sim \text{Poisson}(p \times q_{out} \times b_i), &i \notin S, \quad q_{out} \sim \text{Gamma}(\alpha_{out}, \beta_{out}). \end{cases} \\ p \sim \text{beta}(\alpha, \beta) \]

Setting Priors

  • Necessary to set an informative prior on reporting rate \(p\)
    • Historical Data
    • Expert elicitation
  • Can set a diffuse prior on the \(q\) parameters

Posterior Estimation

  • The marginal likelihood of the null hypothesis is now: \[ P(D|H_0) = \int \int \pi(q_{all}) \times \pi(p) \times \prod_{i \in G} P(D|q_{all}, p, b_i) dq_{all} dp \]
  • The marginal likelihood under a candidate region \(S\) is now: \[ P(D|H_1(S)) = \int \int \pi(q_{in}) \times \pi(p) \times \prod_{i \in S} P(D|q_{in}, p, b_i) dq_{in} dp \times \\ \quad \int \int \pi(q_{out}) \times \pi(p) \times \prod_{i \in S - G} P(D|q_{out}, p, b_i) dq_{out} dp \]
  • Posterior samples are obtained through MCMC sampling in stan

Decision Making

  • Decision should be based on estimate of risk ratio within candidate cluster and outside. \[ \widehat{RR} = \frac{\widehat{q_{in}}}{\widehat{q_{out}}} \]
  • Bayes factors provide evidence for alternative hypothesis
    • calculated using bridge sampling1 in R
  • Most likely cluster selected based on largest risk ratio and Bayes factor

Simulation Study

Simulation Design

  • 39 counties of Washington state with an outbreak of 3 counties in south eastern Washington
  • Baseline values where determined by 100,000 total cases to start
  • 50 simulated data sets for each set of parameters
    • Reporting rate: 0.1, 0.2, 0.3, 0.4, and 0.5
    • Outbreak effect (\(\Delta = q_{in} - q_{out}\)): 0.15, 0.20, 0.25, 0.30, 1.0, and 3.01

Simulation priors

  • Priors for reporting rate \(p\) where set using the betabuster tool in epiR package \[ p \sim \text{Beta}(3.5, 23) \quad \text{if} \quad p = 0.1 \\ p \sim \text{Beta}(4.5, 15) \quad \text{if} \quad p = 0.2 \\ p \sim \text{Beta}(10, 22) \quad \text{if} \quad p = 0.3 \\ p \sim \text{Beta}(13, 19) \quad \text{if} \quad p = 0.4 \\ p \sim \text{Beta}(1, 1) \quad \text{if} \quad p = 0.5 \]
  • Priors for \(q\) \[ q_{all} \sim \text{gamma}(2, 0.5) \\ q_{out} \sim \text{gamma}(2, 0.4) \\ q_{in} \sim \text{gamma}(2, 0.5) \]

Simulation Metrics

Even when the null hypothesis is correctly rejected, the detected clusters rarely match the true cluster exactly.

To evaluate how well they overlap we will use:

  • Power: Proportion of detected clusters exactly match true cluster
  • Sensitivity: Proportion of true cases correctly included
  • Positive Predicted Value (PPV): Proportion of detected cases that are actually in the true cluster

Simulation Results Visual

Application: Texas COVID-19 Data

Texas COVID-19 Data

  • COVID-19 data in early 2020 were severely under-reported due to limited testing and difficulty to diagnose Hortaçsu, Liu, and Schwieg (2021)
  • Data (254 Counties)
    • COVID-19 cases (Probable and Confirmed)
    • Population

Real Data (priors)

  • Estimates from early COVID-19 studies suggest very low reporting rates (\(\approx 10\%\)), with low probability of exceeding 30\(\%\) Chen, Song, and Stamey (2022).
  • This information results in a prior of \(p \sim \text{Beta}(3.5, 23)\)
  • Diffuse priors where fit to \(q_\cdot\) parameters

\[ q_{all} \sim \text{gamma}(1.5, 1) \\ q_{out} \sim \text{gamma}(1.5, 1) \\ q_{in} \sim \text{gamma}(1.5, 1) \]

Real Data Results

Both methods provide different most likely clusters;

  • Naive: Around the city of Houston
  • Under-reported: Around El Paso and Texas panhandle.

Bayes factors for each identified cluster is very large indicating significant evidence in favor of \(H_1\) over \(H_0\).

Method log BF RR
Naive 1 875,698.5 1 NA
Under-reported 1 714.0 3.06
Under-reported 2 292.8 2.71
1 Not provided in output from ScanStatistic package.

Discussion

  • Traditional scan statistics may fail when case counts are under-reported
  • The proposed method models reporting probability, improving cluster detection under incomplete data
  • In simulation study and real data example the proposed method outperforms the naive Bayesian scan statistic
  • The proposed method performs significantly better when the effect is small and reporting is low
    • Given an informative prior is available and reasonable

Future work

  • Extend to spatiotemporal model for real-time detection
  • Incorporate multivariate outcomes
  • Allow spatially varying rates to reflect local testing access

Thank you!

Bibliography

Chen, Jinjie, Joon Jin Song, and James D. Stamey. 2022. “A Bayesian Hierarchical Spatial Model to Correct for Misreporting in Count Data: Application to State-Level COVID-19 Data in the United States.” International Journal of Environmental Research and Public Health 19 (6): 3327. https://doi.org/10.3390/ijerph19063327.
Gronau, Quentin F, Henrik Singmann, and Eric-Jan Wagenmakers. n.d. “Bridgesampling: An R Package for Estimating Normalizing Constants.”
Hortaçsu, Ali, Jiarui Liu, and Timothy Schwieg. 2021. “Estimating the Fraction of Unreported Infections in Epidemics with a Known Epicenter: An Application to COVID-19.” Journal of Econometrics, Pandemic Econometrics, 220 (1): 106–29. https://doi.org/10.1016/j.jeconom.2020.07.047.
Kulldorff, Martin. 1997. “A Spatial Scan Statistic.” Communications in Statistics - Theory and Methods 26 (6): 1481–96. https://doi.org/10.1080/03610929708831995.
Kulldorff, Martin, Farzad Mostashari, Luiz Duczmal, W. Katherine Yih, Ken Kleinman, and Richard Platt. 2007. “Multivariate Scan Statistics for Disease Surveillance.” Statistics in Medicine 26 (8): 1824–33. https://doi.org/10.1002/sim.2818.
Neill, Daniel B. 2024. “Bayesian Scan Statistics.” In Handbook of Scan Statistics, edited by Joseph Glaz and Markos V. Koutras, 83–103. New York, NY: Springer. https://doi.org/10.1007/978-1-4614-8033-4_28.
Neill, Daniel, Andrew Moore, and Gregory Cooper. 2005. “A Bayesian Spatial Scan Statistic.” In Advances in Neural Information Processing Systems. Vol. 18. MIT Press. https://proceedings.neurips.cc/paper_files/paper/2005/hash/28acfe2da49d2b9a7f177458256f2540-Abstract.html.
Shao, Kan, Yandong Liu, and Daniel B. Neill. 2011. “A Generalized Fast Subset Sums Framework for Bayesian Event Detection.” In 2011 IEEE 11th International Conference on Data Mining, 617–25. https://doi.org/10.1109/ICDM.2011.11.

Appendix: MCMC convergence (Sim)

Null samples from simulation

Alternative samples from simulation

Appendix: MCMC convergence (COVID-19)

Null samples from simulation

Alternative samples from simulation

Appendix: Priors

Appendix: Equations

\[ P(D|H_0) \propto \frac{\beta_{all}^{\alpha_{all}}}{\Gamma(\alpha_{all}) \mathcal{B}(\alpha, \beta)} \int \int q_{all}^{\alpha_{all} + C - 1} p^{\alpha + C - 1} (1 - p)^{\beta - 1} e^{-q_{all}\beta_{all} - p q_{all} B} dp dq_{all} \] \[ P(D|H_1(S)) \propto \frac{\beta_{in}^{\alpha_{in}}}{\Gamma(\alpha_{in}) \mathcal{B}(\alpha, \beta)} \int \int q_{in}^{\alpha_{in} + C_S - 1} p^{\alpha + C_S - 1} (1 - p)^{\beta - 1} e^{-q_{in}\beta_{in} - p q_{in} B_S} dp dq_{in} \times \\ \quad \frac{\beta_{out}^{\alpha_{out}}}{\Gamma(\alpha_{out}) \mathcal{B}(\alpha, \beta)} \int \int q_{out}^{\alpha_{out} + C_{S-G} - 1} p^{\alpha + C_{S-G} - 1} (1 - p)^{\beta - 1} e^{-q_{out}\beta_{out} - p q_{out} B_{S-G}} dp dq_{out} \]

\[ P(D) = P(D|H_0) P(H_0) + \sum_{s_i \in G} P(D|H_1(S)) P(H_1(S)) \]